In [9]:
IPython.display.HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[9]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

Making sense of City HR data using spaCy and sk-learn

Gordon Inggs, Data Scientist, City of Cape Town

Outline

  1. Context
  2. Transforming data into a form for analysis (NLP bit)
  3. Data-relevance Scoring (more NLP)
  4. Investigating dynamics within the data (PCA bit)

Context

Why were we doing this?

  • City of Cape Town has a Data Strategy.
  • City-wide initiative to improve how the City works with data.
  • One part of the strategy (Data Capabilities) concerns City employees
  • Need to understand how "data-intensive" work is

Caveats

  1. Use of Formal HR data
  2. Use of pre-trained models

Transforming Data

In [10]:
IPython.display.HTML('./gordon_source_df.html')
Out[10]:
Directorate Department PositionName CriteriaGroup Row RowVector AppraisalScoreWeight
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Discipline Specific Skills L3 [-0.115235664, 0.094851844, -0.032811504, -0.1... 25
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Impact and Influence L3 [-0.185014, 0.27602965, -0.020265013, -0.01785... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Organisational Awareness L3 [-0.0405184, 0.15518801, 0.110339, 0.008534556... 15
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Planning and Organising L3 [-0.03400433, 0.02099867, 0.016796663, -0.0964... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's ANALYTIC DRIVEN CULTURE [-0.01125025, 0.105551496, 0.2284475, -0.08509... 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA AUTOMATION [-0.24759, 0.0056599975, 0.28850502, 0.09628, ... 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA INSIGHT [-0.107594505, 0.18723, -0.019495003, 0.2254, ... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA REQUIREMENTS [0.006064996, -0.272295, -0.1181675, 0.003385,... 30

Cleaning

nlp = spacy.load('en_core_web_lg')
stop_words = {
    "service", "delivery",
    "function", "functions",
    "orientation", "orientations",
    "problem", "solving",
    "cfadm", "cfpro", "cfuni", "cfsup", "cfart", "cfman", "cfart", "cftec",
    "kpaa", "kpan",
    "l1",  "l2", "l3", "l4", "l5"
}
nlp.Defaults.stop_words |= stop_words
hr_df.Row.apply(
    lambda x: [
        token.text.lower() 
        for token in nlp(x) 
        if not token.is_punct and not token.is_stop
    ]
)
  • Takes $\approx$ 2 mins using 16 cores.
  • 4 chunks per core, at least 10k entries per core

Embedding

row_vectors = data_df.Row.apply(
    lambda row: nlp(row.lower()).vector
)
  • Also takes $\approx$ 2 mins using 16 cores.

Reducing critera -> positions

Using centre of mass formula:

$$C = \frac{\sum_i^N{W_i X_i}}{\sum_i^N{W_i}}$$
  • $C$ - new position
  • $N$ - Number of entries in row $i$
  • $W_i$ - row $i$'s weight
  • $X_i$ - row $i$'s vector
In [16]:
IPython.display.HTML('./gordon_source_df.html')
Out[16]:
Directorate Department PositionName CriteriaGroup Row RowVector AppraisalScoreWeight
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Discipline Specific Skills L3 [-0.115235664, 0.094851844, -0.032811504, -0.1... 25
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Impact and Influence L3 [-0.185014, 0.27602965, -0.020265013, -0.01785... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Organisational Awareness L3 [-0.0405184, 0.15518801, 0.110339, 0.008534556... 15
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Planning and Organising L3 [-0.03400433, 0.02099867, 0.016796663, -0.0964... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's ANALYTIC DRIVEN CULTURE [-0.01125025, 0.105551496, 0.2284475, -0.08509... 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA AUTOMATION [-0.24759, 0.0056599975, 0.28850502, 0.09628, ... 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA INSIGHT [-0.107594505, 0.18723, -0.019495003, 0.2254, ... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA REQUIREMENTS [0.006064996, -0.272295, -0.1181675, 0.003385,... 30

a few weighted averages later...

In [17]:
IPython.display.HTML('./gordon_cg_df.html')
Out[17]:
Directorate Department PositionName CriteriaGroup CriteriaGroupVector
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies [-0.10059217, 0.13609965, 0.00730747, -0.06769...
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's [-0.0822269, -0.0032772017, 0.062091753, 0.070...

and a few more...

In [18]:
IPython.display.HTML('./gordon_position_df.html')
Out[18]:
Directorate Department PositionName PositionVector
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci [-0.08773648, 0.038535856, 0.045656465, 0.0293...

But what does this actually look like?!?

In [19]:
IPython.display.HTML('./hr_translation_I.html')
Out[19]:
Bokeh Plot
In [20]:
IPython.display.HTML('./hr_translation_II.html')
Out[20]:
Bokeh Plot

Data-relevance Scoring

Relationship to data-intensive work

data_words = [
    "data",
    "gathering",
    "processing",
    "analysis",
    "dissemination"
]
data_word_vectors = {
    word: nlp(word.lower()).vector
    for word in data_words
}
for word, word_vector in data_word_vectors.items():
    score_df[f"{word.title()}Score"] = sklearn.metrics.pairwise.cosine_similarity(
        numpy.vstack(score_df.PositionVector.values),
        numpy.array([word_vector])
    )
  • Faily fast - few seconds at most
In [25]:
IPython.display.HTML('./data_score_df.html')
Out[25]:
Directorate Department PositionName DataScore GatheringScore ProcessingScore AnalysisScore DisseminationScore
4869 CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci 0.861595 0.411872 0.603572 0.698752 0.495789

But what does this actually look like?!?

In [23]:
IPython.display.HTML('./data_scoring.html')
Out[23]:
Bokeh Plot

On validation...

  • Those affliated to the Data Strategy are probably doing data-related work...
In [28]:
IPython.display.HTML('./data_scoring_comparison.html')
Out[28]:
Bokeh Plot

Investigating Dynamics

  • Principal Component Analysis - tries to explain variance (difference in the dataset).
  • Remaps data into new, reduced dimension form.
  • Sometimes, these dimensions have meanings.
In [29]:
IPython.display.HTML('./data_scoring_pca.html')
Out[29]:
Bokeh Plot

Conclusion

Key Findings

  • City job description data appears amenable to NLP analysis
  • City positions seem to have three groupings:
    • Intensive workers (the green band)
    • Majority in the middle (the grey band)
    • Low intensity/bad data (the red band)
  • 'Processing' and 'Analysis' terminology is more prevelant than 'Gathering' and 'Dissemination'.

Recommendations

  1. Analysis is validated, qualitatively
  2. Use the 'green band' as beta testers for Data Strategy initiatives
  3. Data Strategy leaderships needs to reflect on absence of 'processing' intensive positions.
  4. ?